Assignment 1 - Exploratory Data Analysis

Author

Hanin Almodaweb

Assignment Description

We will work with air pollution data from the U.S. Environmental Protection Agency (EPA). The EPA has a national monitoring network of air pollution sites that The primary question you will answer is whether daily concentrations of PM\(_{2.5}\) (particulate matter air pollution with aerodynamic diameter less than 2.5 \(\mu\)m) have decreased in California over the last 20 years (from 2002 to 2022).

Steps

  1. Given the formulated question from the assignment description, you will now conduct EDA Checklist items 2-4. First, download 2002 and 2022 data for all sites in California from the EPA Air Quality Data website. Read in the data using data.table(). For each of the two datasets, check the dimensions, headers, footers, variable names and variable types. Check for any data issues, particularly in the key variable we are analyzing. Make sure you write up a summary of all of your findings.
# reading data into R
EPA2002 <- data.table::fread("/Users/neens/Downloads/ad_viz_plotval_data.csv")
EPA2022 <- data.table::fread("/Users/neens/Downloads/ad_viz_plotval_data-2.csv")
# checking the california 2002 data set 
dim(EPA2002)
[1] 15976    22
head(EPA2002)
         Date Source  Site ID   POC Daily Mean PM2.5 Concentration    Units
       <char> <char>    <int> <int>                          <num>   <char>
1: 01/05/2002    AQS 60010007     1                           25.1 ug/m3 LC
2: 01/06/2002    AQS 60010007     1                           31.6 ug/m3 LC
3: 01/08/2002    AQS 60010007     1                           21.4 ug/m3 LC
4: 01/11/2002    AQS 60010007     1                           25.9 ug/m3 LC
5: 01/14/2002    AQS 60010007     1                           34.5 ug/m3 LC
6: 01/17/2002    AQS 60010007     1                           41.0 ug/m3 LC
   Daily AQI Value Local Site Name Daily Obs Count Percent Complete
             <int>          <char>           <int>            <num>
1:              81       Livermore               1              100
2:              93       Livermore               1              100
3:              74       Livermore               1              100
4:              82       Livermore               1              100
5:              98       Livermore               1              100
6:             115       Livermore               1              100
   AQS Parameter Code AQS Parameter Description Method Code
                <int>                    <char>       <int>
1:              88101  PM2.5 - Local Conditions         120
2:              88101  PM2.5 - Local Conditions         120
3:              88101  PM2.5 - Local Conditions         120
4:              88101  PM2.5 - Local Conditions         120
5:              88101  PM2.5 - Local Conditions         120
6:              88101  PM2.5 - Local Conditions         120
                      Method Description CBSA Code
                                  <char>     <int>
1: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
2: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
3: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
4: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
5: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
6: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
                           CBSA Name State FIPS Code      State
                              <char>           <int>     <char>
1: San Francisco-Oakland-Hayward, CA               6 California
2: San Francisco-Oakland-Hayward, CA               6 California
3: San Francisco-Oakland-Hayward, CA               6 California
4: San Francisco-Oakland-Hayward, CA               6 California
5: San Francisco-Oakland-Hayward, CA               6 California
6: San Francisco-Oakland-Hayward, CA               6 California
   County FIPS Code  County Site Latitude Site Longitude
              <int>  <char>         <num>          <num>
1:                1 Alameda      37.68753      -121.7842
2:                1 Alameda      37.68753      -121.7842
3:                1 Alameda      37.68753      -121.7842
4:                1 Alameda      37.68753      -121.7842
5:                1 Alameda      37.68753      -121.7842
6:                1 Alameda      37.68753      -121.7842
tail(EPA2002)
         Date Source  Site ID   POC Daily Mean PM2.5 Concentration    Units
       <char> <char>    <int> <int>                          <num>   <char>
1: 12/10/2002    AQS 61131003     1                             15 ug/m3 LC
2: 12/13/2002    AQS 61131003     1                             15 ug/m3 LC
3: 12/22/2002    AQS 61131003     1                              1 ug/m3 LC
4: 12/25/2002    AQS 61131003     1                             23 ug/m3 LC
5: 12/28/2002    AQS 61131003     1                              5 ug/m3 LC
6: 12/31/2002    AQS 61131003     1                              6 ug/m3 LC
   Daily AQI Value      Local Site Name Daily Obs Count Percent Complete
             <int>               <char>           <int>            <num>
1:              62 Woodland-Gibson Road               1              100
2:              62 Woodland-Gibson Road               1              100
3:               6 Woodland-Gibson Road               1              100
4:              77 Woodland-Gibson Road               1              100
5:              28 Woodland-Gibson Road               1              100
6:              33 Woodland-Gibson Road               1              100
   AQS Parameter Code AQS Parameter Description Method Code
                <int>                    <char>       <int>
1:              88101  PM2.5 - Local Conditions         117
2:              88101  PM2.5 - Local Conditions         117
3:              88101  PM2.5 - Local Conditions         117
4:              88101  PM2.5 - Local Conditions         117
5:              88101  PM2.5 - Local Conditions         117
6:              88101  PM2.5 - Local Conditions         117
                      Method Description CBSA Code
                                  <char>     <int>
1: R & P Model 2000 PM2.5 Sampler w/WINS     40900
2: R & P Model 2000 PM2.5 Sampler w/WINS     40900
3: R & P Model 2000 PM2.5 Sampler w/WINS     40900
4: R & P Model 2000 PM2.5 Sampler w/WINS     40900
5: R & P Model 2000 PM2.5 Sampler w/WINS     40900
6: R & P Model 2000 PM2.5 Sampler w/WINS     40900
                                 CBSA Name State FIPS Code      State
                                    <char>           <int>     <char>
1: Sacramento--Roseville--Arden-Arcade, CA               6 California
2: Sacramento--Roseville--Arden-Arcade, CA               6 California
3: Sacramento--Roseville--Arden-Arcade, CA               6 California
4: Sacramento--Roseville--Arden-Arcade, CA               6 California
5: Sacramento--Roseville--Arden-Arcade, CA               6 California
6: Sacramento--Roseville--Arden-Arcade, CA               6 California
   County FIPS Code County Site Latitude Site Longitude
              <int> <char>         <num>          <num>
1:              113   Yolo      38.66121      -121.7327
2:              113   Yolo      38.66121      -121.7327
3:              113   Yolo      38.66121      -121.7327
4:              113   Yolo      38.66121      -121.7327
5:              113   Yolo      38.66121      -121.7327
6:              113   Yolo      38.66121      -121.7327
# checking variable names and variable types for the 2002 data set
str(EPA2002)
Classes 'data.table' and 'data.frame':  15976 obs. of  22 variables:
 $ Date                          : chr  "01/05/2002" "01/06/2002" "01/08/2002" "01/11/2002" ...
 $ Source                        : chr  "AQS" "AQS" "AQS" "AQS" ...
 $ Site ID                       : int  60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ...
 $ POC                           : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Daily Mean PM2.5 Concentration: num  25.1 31.6 21.4 25.9 34.5 41 29.3 15 18.8 37.9 ...
 $ Units                         : chr  "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
 $ Daily AQI Value               : int  81 93 74 82 98 115 89 62 69 107 ...
 $ Local Site Name               : chr  "Livermore" "Livermore" "Livermore" "Livermore" ...
 $ Daily Obs Count               : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Percent Complete              : num  100 100 100 100 100 100 100 100 100 100 ...
 $ AQS Parameter Code            : int  88101 88101 88101 88101 88101 88101 88101 88101 88101 88101 ...
 $ AQS Parameter Description     : chr  "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ...
 $ Method Code                   : int  120 120 120 120 120 120 120 120 120 120 ...
 $ Method Description            : chr  "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" ...
 $ CBSA Code                     : int  41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ...
 $ CBSA Name                     : chr  "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
 $ State FIPS Code               : int  6 6 6 6 6 6 6 6 6 6 ...
 $ State                         : chr  "California" "California" "California" "California" ...
 $ County FIPS Code              : int  1 1 1 1 1 1 1 1 1 1 ...
 $ County                        : chr  "Alameda" "Alameda" "Alameda" "Alameda" ...
 $ Site Latitude                 : num  37.7 37.7 37.7 37.7 37.7 ...
 $ Site Longitude                : num  -122 -122 -122 -122 -122 ...
 - attr(*, ".internal.selfref")=<externalptr> 
# checking for data issues
# all variables
summary(EPA2002)
     Date              Source             Site ID              POC       
 Length:15976       Length:15976       Min.   :60010007   Min.   :1.000  
 Class :character   Class :character   1st Qu.:60290014   1st Qu.:1.000  
 Mode  :character   Mode  :character   Median :60590007   Median :1.000  
                                       Mean   :60549600   Mean   :1.581  
                                       3rd Qu.:60731002   3rd Qu.:1.000  
                                       Max.   :61131003   Max.   :6.000  
                                                                         
 Daily Mean PM2.5 Concentration    Units           Daily AQI Value 
 Min.   :  0.00                 Length:15976       Min.   :  0.00  
 1st Qu.:  7.00                 Class :character   1st Qu.: 39.00  
 Median : 12.00                 Mode  :character   Median : 56.00  
 Mean   : 16.12                                    Mean   : 59.28  
 3rd Qu.: 20.50                                    3rd Qu.: 72.00  
 Max.   :104.30                                    Max.   :185.00  
                                                                   
 Local Site Name    Daily Obs Count Percent Complete AQS Parameter Code
 Length:15976       Min.   :1       Min.   :100      Min.   :88101     
 Class :character   1st Qu.:1       1st Qu.:100      1st Qu.:88101     
 Mode  :character   Median :1       Median :100      Median :88101     
                    Mean   :1       Mean   :100      Mean   :88215     
                    3rd Qu.:1       3rd Qu.:100      3rd Qu.:88502     
                    Max.   :1       Max.   :100      Max.   :88502     
                                                                       
 AQS Parameter Description  Method Code  Method Description   CBSA Code    
 Length:15976              Min.   :117   Length:15976       Min.   :12540  
 Class :character          1st Qu.:120   Class :character   1st Qu.:23420  
 Mode  :character          Median :120   Mode  :character   Median :40140  
                           Mean   :297                      Mean   :33270  
                           3rd Qu.:707                      3rd Qu.:41740  
                           Max.   :810                      Max.   :49700  
                                                            NA's   :929    
  CBSA Name         State FIPS Code    State           County FIPS Code
 Length:15976       Min.   :6       Length:15976       Min.   :  1.00  
 Class :character   1st Qu.:6       Class :character   1st Qu.: 29.00  
 Mode  :character   Median :6       Mode  :character   Median : 59.00  
                    Mean   :6                          Mean   : 54.78  
                    3rd Qu.:6                          3rd Qu.: 73.00  
                    Max.   :6                          Max.   :113.00  
                                                                       
    County          Site Latitude   Site Longitude  
 Length:15976       Min.   :32.63   Min.   :-124.2  
 Class :character   1st Qu.:34.07   1st Qu.:-121.4  
 Mode  :character   Median :35.36   Median :-119.1  
                    Mean   :36.00   Mean   :-119.4  
                    3rd Qu.:37.77   3rd Qu.:-117.9  
                    Max.   :41.71   Max.   :-115.5  
                                                    
# daily Mean PM2.5 Concentration variable
summary(EPA2002$`Daily Mean PM2.5 Concentration`)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    7.00   12.00   16.12   20.50  104.30 
# missing values
anyNA(EPA2002)
[1] TRUE
mean(is.na(EPA2002$`Daily Mean PM2.5 Concentration`))
[1] 0
Summary of 2002 Findings

The 2002 data set consists of 15,976 rows and 22 columns (variables), with no apparent missing data in the headers or footers. Initial checks of the data structure indicated a mix of character, integer, and numeric data types. The variable names include Date, Source, Site ID, POC, Daily Mean PM\(_{2.5}\) Concentration, Units, Daily AQI Value, Local Site Name, Daily Obs Count, Percent Complete, AQS Parameter Code, Parameter Description, Method Code, Method Description, CBSA Code, CBSA Name, State FIPS Code, State, County FIPS Code, County, Site Latitude, and Site Longitude. The character variables of interest are date, state, and county. While te numerical variables under study are daily mean PM2.5 concentration, site latitude, and site longitude.

Upon examining the data, the majority of the Daily Mean PM\(_{2.5}\) Concentration values range between 0 and 104.3 µg/m³, with a mean of 16.12 µg/m³, a median of 12 µg/m³, and a maximum value of 185 µg/m³. There are no missing values in the Daily Mean PM\(_{2.5}\) Concentration column, ensuring the key variable of interest is complete for analysis. Nonetheless, while the data set was mostly complete, the presence of missing values requires further investigation to ensure data quality. A closer examination of missing data patterns and potential outliers, particularly in the PM\(_{2.5}\) measurements, is necessary to identify any inconsistencies.

# checking the california 2022 data set 
dim(EPA2022)
[1] 59756    22
head(EPA2022)
         Date Source  Site ID   POC Daily Mean PM2.5 Concentration    Units
       <char> <char>    <int> <int>                          <num>   <char>
1: 01/01/2022    AQS 60010007     3                           12.7 ug/m3 LC
2: 01/02/2022    AQS 60010007     3                           13.9 ug/m3 LC
3: 01/03/2022    AQS 60010007     3                            7.1 ug/m3 LC
4: 01/04/2022    AQS 60010007     3                            3.7 ug/m3 LC
5: 01/05/2022    AQS 60010007     3                            4.2 ug/m3 LC
6: 01/06/2022    AQS 60010007     3                            3.8 ug/m3 LC
   Daily AQI Value Local Site Name Daily Obs Count Percent Complete
             <int>          <char>           <int>            <num>
1:              58       Livermore               1              100
2:              60       Livermore               1              100
3:              39       Livermore               1              100
4:              21       Livermore               1              100
5:              23       Livermore               1              100
6:              21       Livermore               1              100
   AQS Parameter Code AQS Parameter Description Method Code
                <int>                    <char>       <int>
1:              88101  PM2.5 - Local Conditions         170
2:              88101  PM2.5 - Local Conditions         170
3:              88101  PM2.5 - Local Conditions         170
4:              88101  PM2.5 - Local Conditions         170
5:              88101  PM2.5 - Local Conditions         170
6:              88101  PM2.5 - Local Conditions         170
                     Method Description CBSA Code
                                 <char>     <int>
1: Met One BAM-1020 Mass Monitor w/VSCC     41860
2: Met One BAM-1020 Mass Monitor w/VSCC     41860
3: Met One BAM-1020 Mass Monitor w/VSCC     41860
4: Met One BAM-1020 Mass Monitor w/VSCC     41860
5: Met One BAM-1020 Mass Monitor w/VSCC     41860
6: Met One BAM-1020 Mass Monitor w/VSCC     41860
                           CBSA Name State FIPS Code      State
                              <char>           <int>     <char>
1: San Francisco-Oakland-Hayward, CA               6 California
2: San Francisco-Oakland-Hayward, CA               6 California
3: San Francisco-Oakland-Hayward, CA               6 California
4: San Francisco-Oakland-Hayward, CA               6 California
5: San Francisco-Oakland-Hayward, CA               6 California
6: San Francisco-Oakland-Hayward, CA               6 California
   County FIPS Code  County Site Latitude Site Longitude
              <int>  <char>         <num>          <num>
1:                1 Alameda      37.68753      -121.7842
2:                1 Alameda      37.68753      -121.7842
3:                1 Alameda      37.68753      -121.7842
4:                1 Alameda      37.68753      -121.7842
5:                1 Alameda      37.68753      -121.7842
6:                1 Alameda      37.68753      -121.7842
tail(EPA2022)
         Date Source  Site ID   POC Daily Mean PM2.5 Concentration    Units
       <char> <char>    <int> <int>                          <num>   <char>
1: 12/01/2022    AQS 61131003     1                            3.4 ug/m3 LC
2: 12/07/2022    AQS 61131003     1                            3.8 ug/m3 LC
3: 12/13/2022    AQS 61131003     1                            6.0 ug/m3 LC
4: 12/19/2022    AQS 61131003     1                           34.8 ug/m3 LC
5: 12/25/2022    AQS 61131003     1                           23.2 ug/m3 LC
6: 12/31/2022    AQS 61131003     1                            1.0 ug/m3 LC
   Daily AQI Value      Local Site Name Daily Obs Count Percent Complete
             <int>               <char>           <int>            <num>
1:              19 Woodland-Gibson Road               1              100
2:              21 Woodland-Gibson Road               1              100
3:              33 Woodland-Gibson Road               1              100
4:              99 Woodland-Gibson Road               1              100
5:              77 Woodland-Gibson Road               1              100
6:               6 Woodland-Gibson Road               1              100
   AQS Parameter Code AQS Parameter Description Method Code
                <int>                    <char>       <int>
1:              88101  PM2.5 - Local Conditions         145
2:              88101  PM2.5 - Local Conditions         145
3:              88101  PM2.5 - Local Conditions         145
4:              88101  PM2.5 - Local Conditions         145
5:              88101  PM2.5 - Local Conditions         145
6:              88101  PM2.5 - Local Conditions         145
                                      Method Description CBSA Code
                                                  <char>     <int>
1: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
2: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
3: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
4: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
5: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
6: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
                                 CBSA Name State FIPS Code      State
                                    <char>           <int>     <char>
1: Sacramento--Roseville--Arden-Arcade, CA               6 California
2: Sacramento--Roseville--Arden-Arcade, CA               6 California
3: Sacramento--Roseville--Arden-Arcade, CA               6 California
4: Sacramento--Roseville--Arden-Arcade, CA               6 California
5: Sacramento--Roseville--Arden-Arcade, CA               6 California
6: Sacramento--Roseville--Arden-Arcade, CA               6 California
   County FIPS Code County Site Latitude Site Longitude
              <int> <char>         <num>          <num>
1:              113   Yolo      38.66121      -121.7327
2:              113   Yolo      38.66121      -121.7327
3:              113   Yolo      38.66121      -121.7327
4:              113   Yolo      38.66121      -121.7327
5:              113   Yolo      38.66121      -121.7327
6:              113   Yolo      38.66121      -121.7327
# checking variable names and variable types for the 2022 data set
str(EPA2022)
Classes 'data.table' and 'data.frame':  59756 obs. of  22 variables:
 $ Date                          : chr  "01/01/2022" "01/02/2022" "01/03/2022" "01/04/2022" ...
 $ Source                        : chr  "AQS" "AQS" "AQS" "AQS" ...
 $ Site ID                       : int  60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ...
 $ POC                           : int  3 3 3 3 3 3 3 3 3 3 ...
 $ Daily Mean PM2.5 Concentration: num  12.7 13.9 7.1 3.7 4.2 3.8 2.3 6.9 13.6 11.2 ...
 $ Units                         : chr  "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
 $ Daily AQI Value               : int  58 60 39 21 23 21 13 38 59 55 ...
 $ Local Site Name               : chr  "Livermore" "Livermore" "Livermore" "Livermore" ...
 $ Daily Obs Count               : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Percent Complete              : num  100 100 100 100 100 100 100 100 100 100 ...
 $ AQS Parameter Code            : int  88101 88101 88101 88101 88101 88101 88101 88101 88101 88101 ...
 $ AQS Parameter Description     : chr  "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ...
 $ Method Code                   : int  170 170 170 170 170 170 170 170 170 170 ...
 $ Method Description            : chr  "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" ...
 $ CBSA Code                     : int  41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ...
 $ CBSA Name                     : chr  "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
 $ State FIPS Code               : int  6 6 6 6 6 6 6 6 6 6 ...
 $ State                         : chr  "California" "California" "California" "California" ...
 $ County FIPS Code              : int  1 1 1 1 1 1 1 1 1 1 ...
 $ County                        : chr  "Alameda" "Alameda" "Alameda" "Alameda" ...
 $ Site Latitude                 : num  37.7 37.7 37.7 37.7 37.7 ...
 $ Site Longitude                : num  -122 -122 -122 -122 -122 ...
 - attr(*, ".internal.selfref")=<externalptr> 
# checking for data issues 
# all variables
summary(EPA2022)
     Date              Source             Site ID              POC       
 Length:59756       Length:59756       Min.   :60010007   Min.   : 1.00  
 Class :character   Class :character   1st Qu.:60290019   1st Qu.: 1.00  
 Mode  :character   Mode  :character   Median :60631006   Median : 3.00  
                                       Mean   :60563315   Mean   : 3.77  
                                       3rd Qu.:60731026   3rd Qu.: 3.00  
                                       Max.   :61131003   Max.   :24.00  
                                                                         
 Daily Mean PM2.5 Concentration    Units           Daily AQI Value 
 Min.   : -6.700                Length:59756       Min.   :  0.00  
 1st Qu.:  4.100                Class :character   1st Qu.: 23.00  
 Median :  6.800                Mode  :character   Median : 38.00  
 Mean   :  8.429                                   Mean   : 39.28  
 3rd Qu.: 10.700                                   3rd Qu.: 54.00  
 Max.   :302.500                                   Max.   :454.00  
                                                                   
 Local Site Name    Daily Obs Count Percent Complete AQS Parameter Code
 Length:59756       Min.   :1       Min.   :100      Min.   :88101     
 Class :character   1st Qu.:1       1st Qu.:100      1st Qu.:88101     
 Mode  :character   Median :1       Median :100      Median :88101     
                    Mean   :1       Mean   :100      Mean   :88192     
                    3rd Qu.:1       3rd Qu.:100      3rd Qu.:88101     
                    Max.   :1       Max.   :100      Max.   :88502     
                                                                       
 AQS Parameter Description  Method Code    Method Description   CBSA Code    
 Length:59756              Min.   :143.0   Length:59756       Min.   :12540  
 Class :character          1st Qu.:170.0   Class :character   1st Qu.:31080  
 Mode  :character          Median :170.0   Mode  :character   Median :40140  
                           Mean   :336.1                      Mean   :34957  
                           3rd Qu.:707.0                      3rd Qu.:41860  
                           Max.   :810.0                      Max.   :49700  
                                                              NA's   :4567   
  CBSA Name         State FIPS Code    State           County FIPS Code
 Length:59756       Min.   :6       Length:59756       Min.   :  1.00  
 Class :character   1st Qu.:6       Class :character   1st Qu.: 29.00  
 Mode  :character   Median :6       Mode  :character   Median : 63.00  
                    Mean   :6                          Mean   : 56.19  
                    3rd Qu.:6                          3rd Qu.: 73.00  
                    Max.   :6                          Max.   :113.00  
                                                                       
    County          Site Latitude   Site Longitude  
 Length:59756       Min.   :32.58   Min.   :-124.2  
 Class :character   1st Qu.:34.07   1st Qu.:-121.4  
 Mode  :character   Median :36.49   Median :-119.6  
                    Mean   :36.24   Mean   :-119.6  
                    3rd Qu.:37.96   3rd Qu.:-117.9  
                    Max.   :41.76   Max.   :-115.5  
                                                    
# daily Mean PM2.5 Concentration variable
summary(EPA2022$`Daily Mean PM2.5 Concentration`)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -6.700   4.100   6.800   8.429  10.700 302.500 
# missing values
anyNA(EPA2022)
[1] TRUE
mean(is.na(EPA2022$`Daily Mean PM2.5 Concentration`))
[1] 0
Summary of 2022 Findings

The 2022 data set contains 59,756 rows and 22 columns (variables), with the headers and footers loaded correctly. There is evidence of missing data, though not the main variable of interest, Daily Mean PM\(_{2.5}\) Concentration. The variable names and types remain consistent with the 2002 data set. Observations show that the majority of the Daily Mean PM\(_{2.5}\) Concentration values range from -6.7 to 302.5 µg/m³, with a mean of 8.43 µg/m³, a median of 6.8 µg/m³, and a maximum of 302.5 µg/m³. However, it is worth noting that it is unusual for PM\(_{2.5}\) concentrations to have negative values, as particulate matter is a physical measurement of pollution in the air. A negative value might indicate an issue with the data collection, sensor calibration, or data processing.

  1. Combine the two years of data into one data frame. Use the Date variable to create a new column for year, which will serve as an identifier. Change the names of the key variables so that they are easier to refer to in your code.
# combining the two data sets
EPA_combined <- rbind(EPA2002, EPA2022, fill = TRUE)
# converting date to date format
EPA_combined$Date <- as.Date(EPA_combined$Date, format = "%m/%d/%Y")

# creating a 'Year' column from the date
EPA_combined$Year <- format(EPA_combined$Date, "%Y")
# renaming the columns of key variables
setnames(EPA_combined, old = c("Daily Mean PM2.5 Concentration", "Daily AQI Value", 
                          "Site ID", "Site Latitude", "Site Longitude"), 
                 new = c("PM2.5", "AQI", "Site_ID", "Latitude", "Longitude"))
# checking the new data set
summary(EPA_combined)
      Date               Source             Site_ID              POC        
 Min.   :2002-01-01   Length:75732       Min.   :60010007   Min.   : 1.000  
 1st Qu.:2022-01-19   Class :character   1st Qu.:60290016   1st Qu.: 1.000  
 Median :2022-05-14   Mode  :character   Median :60612003   Median : 3.000  
 Mean   :2018-04-13                      Mean   :60560422   Mean   : 3.309  
 3rd Qu.:2022-09-09                      3rd Qu.:60731022   3rd Qu.: 3.000  
 Max.   :2022-12-31                      Max.   :61131003   Max.   :24.000  
                                                                            
     PM2.5           Units                AQI        Local Site Name   
 Min.   : -6.70   Length:75732       Min.   :  0.0   Length:75732      
 1st Qu.:  4.50   Class :character   1st Qu.: 25.0   Class :character  
 Median :  7.60   Mode  :character   Median : 42.0   Mode  :character  
 Mean   : 10.05                      Mean   : 43.5                     
 3rd Qu.: 12.20                      3rd Qu.: 57.0                     
 Max.   :302.50                      Max.   :454.0                     
                                                                       
 Daily Obs Count Percent Complete AQS Parameter Code AQS Parameter Description
 Min.   :1       Min.   :100      Min.   :88101      Length:75732             
 1st Qu.:1       1st Qu.:100      1st Qu.:88101      Class :character         
 Median :1       Median :100      Median :88101      Mode  :character         
 Mean   :1       Mean   :100      Mean   :88197                               
 3rd Qu.:1       3rd Qu.:100      3rd Qu.:88101                               
 Max.   :1       Max.   :100      Max.   :88502                               
                                                                              
  Method Code    Method Description   CBSA Code      CBSA Name        
 Min.   :117.0   Length:75732       Min.   :12540   Length:75732      
 1st Qu.:170.0   Class :character   1st Qu.:31080   Class :character  
 Median :170.0   Mode  :character   Median :40140   Mode  :character  
 Mean   :327.8                      Mean   :34595                     
 3rd Qu.:707.0                      3rd Qu.:41740                     
 Max.   :810.0                      Max.   :49700                     
                                    NA's   :5496                      
 State FIPS Code    State           County FIPS Code    County         
 Min.   :6       Length:75732       Min.   :  1.00   Length:75732      
 1st Qu.:6       Class :character   1st Qu.: 29.00   Class :character  
 Median :6       Mode  :character   Median : 61.00   Mode  :character  
 Mean   :6                          Mean   : 55.89                     
 3rd Qu.:6                          3rd Qu.: 73.00                     
 Max.   :6                          Max.   :113.00                     
                                                                       
    Latitude       Longitude          Year          
 Min.   :32.58   Min.   :-124.2   Length:75732      
 1st Qu.:34.07   1st Qu.:-121.4   Class :character  
 Median :36.48   Median :-119.3   Mode  :character  
 Mean   :36.19   Mean   :-119.5                     
 3rd Qu.:37.96   3rd Qu.:-117.9                     
 Max.   :41.76   Max.   :-115.5                     
                                                    
head(EPA_combined)
         Date Source  Site_ID   POC PM2.5    Units   AQI Local Site Name
       <Date> <char>    <int> <int> <num>   <char> <int>          <char>
1: 2002-01-05    AQS 60010007     1  25.1 ug/m3 LC    81       Livermore
2: 2002-01-06    AQS 60010007     1  31.6 ug/m3 LC    93       Livermore
3: 2002-01-08    AQS 60010007     1  21.4 ug/m3 LC    74       Livermore
4: 2002-01-11    AQS 60010007     1  25.9 ug/m3 LC    82       Livermore
5: 2002-01-14    AQS 60010007     1  34.5 ug/m3 LC    98       Livermore
6: 2002-01-17    AQS 60010007     1  41.0 ug/m3 LC   115       Livermore
   Daily Obs Count Percent Complete AQS Parameter Code
             <int>            <num>              <int>
1:               1              100              88101
2:               1              100              88101
3:               1              100              88101
4:               1              100              88101
5:               1              100              88101
6:               1              100              88101
   AQS Parameter Description Method Code                    Method Description
                      <char>       <int>                                <char>
1:  PM2.5 - Local Conditions         120 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS
2:  PM2.5 - Local Conditions         120 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS
3:  PM2.5 - Local Conditions         120 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS
4:  PM2.5 - Local Conditions         120 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS
5:  PM2.5 - Local Conditions         120 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS
6:  PM2.5 - Local Conditions         120 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS
   CBSA Code                         CBSA Name State FIPS Code      State
       <int>                            <char>           <int>     <char>
1:     41860 San Francisco-Oakland-Hayward, CA               6 California
2:     41860 San Francisco-Oakland-Hayward, CA               6 California
3:     41860 San Francisco-Oakland-Hayward, CA               6 California
4:     41860 San Francisco-Oakland-Hayward, CA               6 California
5:     41860 San Francisco-Oakland-Hayward, CA               6 California
6:     41860 San Francisco-Oakland-Hayward, CA               6 California
   County FIPS Code  County Latitude Longitude   Year
              <int>  <char>    <num>     <num> <char>
1:                1 Alameda 37.68753 -121.7842   2002
2:                1 Alameda 37.68753 -121.7842   2002
3:                1 Alameda 37.68753 -121.7842   2002
4:                1 Alameda 37.68753 -121.7842   2002
5:                1 Alameda 37.68753 -121.7842   2002
6:                1 Alameda 37.68753 -121.7842   2002
tail(EPA_combined)
         Date Source  Site_ID   POC PM2.5    Units   AQI      Local Site Name
       <Date> <char>    <int> <int> <num>   <char> <int>               <char>
1: 2022-12-01    AQS 61131003     1   3.4 ug/m3 LC    19 Woodland-Gibson Road
2: 2022-12-07    AQS 61131003     1   3.8 ug/m3 LC    21 Woodland-Gibson Road
3: 2022-12-13    AQS 61131003     1   6.0 ug/m3 LC    33 Woodland-Gibson Road
4: 2022-12-19    AQS 61131003     1  34.8 ug/m3 LC    99 Woodland-Gibson Road
5: 2022-12-25    AQS 61131003     1  23.2 ug/m3 LC    77 Woodland-Gibson Road
6: 2022-12-31    AQS 61131003     1   1.0 ug/m3 LC     6 Woodland-Gibson Road
   Daily Obs Count Percent Complete AQS Parameter Code
             <int>            <num>              <int>
1:               1              100              88101
2:               1              100              88101
3:               1              100              88101
4:               1              100              88101
5:               1              100              88101
6:               1              100              88101
   AQS Parameter Description Method Code
                      <char>       <int>
1:  PM2.5 - Local Conditions         145
2:  PM2.5 - Local Conditions         145
3:  PM2.5 - Local Conditions         145
4:  PM2.5 - Local Conditions         145
5:  PM2.5 - Local Conditions         145
6:  PM2.5 - Local Conditions         145
                                      Method Description CBSA Code
                                                  <char>     <int>
1: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
2: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
3: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
4: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
5: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
6: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
                                 CBSA Name State FIPS Code      State
                                    <char>           <int>     <char>
1: Sacramento--Roseville--Arden-Arcade, CA               6 California
2: Sacramento--Roseville--Arden-Arcade, CA               6 California
3: Sacramento--Roseville--Arden-Arcade, CA               6 California
4: Sacramento--Roseville--Arden-Arcade, CA               6 California
5: Sacramento--Roseville--Arden-Arcade, CA               6 California
6: Sacramento--Roseville--Arden-Arcade, CA               6 California
   County FIPS Code County Latitude Longitude   Year
              <int> <char>    <num>     <num> <char>
1:              113   Yolo 38.66121 -121.7327   2022
2:              113   Yolo 38.66121 -121.7327   2022
3:              113   Yolo 38.66121 -121.7327   2022
4:              113   Yolo 38.66121 -121.7327   2022
5:              113   Yolo 38.66121 -121.7327   2022
6:              113   Yolo 38.66121 -121.7327   2022
  1. Create a basic map in leaflet() that shows the locations of the sites (make sure to use different colors for each year). Summarize the spatial distribution of the monitoring sites.
# ensuring the data is in the correct format
EPA_combined$Year <- as.numeric(EPA_combined$Year)

# defining a color palette for the years (2002 and 2022)
palette <- colorFactor(palette = c("turquoise", "pink"), domain = EPA_combined$Year)

# creating the leaflet map
leaflet(EPA_combined) %>%
  addTiles() %>%  
  addCircleMarkers(
    ~Longitude, ~Latitude,  # Set the longitude and latitude
    color = ~palette(Year), # Use different colors for each year
    popup = ~paste("Site ID:", Site_ID, "<br>", 
                   "Year:", Year, "<br>",
                   "PM2.5:", PM2.5, "<br>",
                   "AQI:", AQI),  # Popup information
    radius = 5, fillOpacity = 0.8, stroke = FALSE
  ) %>%
  addLegend(
    "bottomright", 
    pal = palette, 
    values = ~Year, 
    title = "Monitoring Year",
    opacity = 1
  )
Summary of the spatial distribution of the monitoring sites

In 2002, monitoring sites were mainly concentrated around major cities like Los Angeles, San Francisco, and Sacramento, with less coverage in central and eastern regions. By 2022, the number of monitoring sites increased, especially in previously underrepresented areas, indicating an expansion of air quality monitoring infrastructure over the two decades.

  1. Check for any missing or implausible values of PM\(_{2.5}\) in the combined dataset. Explore the proportions of each and provide a summary of any temporal patterns you see in these observations.
# checking for missing values in PM2.5
missing_PM25 <- EPA_combined[is.na(PM2.5), .N]

# checking for implausible values (e.g., negative values or values above 500 ug/m^3 (as given by the 2012 EPA) 
implausible_PM25 <- EPA_combined[PM2.5 < 0 | PM2.5 > 500, .N]

# total number of observations
total_obs <- nrow(EPA_combined)

# calculating proportions of missing and implausible values
prop_missing <- missing_PM25 / total_obs
prop_implausible <- implausible_PM25 / total_obs

# summary of findings
cat("Total Observations:", total_obs, "\n")
Total Observations: 75732 
cat("Missing PM2.5 Values:", missing_PM25, "(", round(prop_missing * 100, 2), "% )\n")
Missing PM2.5 Values: 0 ( 0 % )
cat("Implausible PM2.5 Values:", implausible_PM25, "(", round(prop_implausible * 100, 2), "% )\n")
Implausible PM2.5 Values: 215 ( 0.28 % )
# exploring temporal patterns in missing and implausible values
missing_by_year <- EPA_combined[is.na(PM2.5), .N, by = Year]
implausible_by_year <- EPA_combined[PM2.5 < 0 | PM2.5 > 500, .N, by = Year]
# displaying the missing and implausible values by year
missing_by_year
Empty data.table (0 rows and 2 cols): Year,N
implausible_by_year
    Year     N
   <num> <int>
1:  2022   215
# examining frequency of implausible values by month
implausible_values <- subset(EPA_combined, PM2.5 < 0 | PM2.5 > 500)

# extracting month from the Date column
implausible_values$Month <- format(as.Date(implausible_values$Date), "%Y-%m")

# creating a table or summary of the count of implausible values by month
implausible_by_month <- table(implausible_values$Month)

# converting to a data frame for easier plotting or viewing
implausible_by_month_df <- as.data.frame(implausible_by_month)

# view the distribution
print(implausible_by_month_df)
      Var1 Freq
1  2022-01   23
2  2022-02   18
3  2022-03    8
4  2022-04    4
5  2022-05   12
6  2022-06   19
7  2022-07   27
8  2022-08    7
9  2022-09   21
10 2022-10    4
11 2022-11   26
12 2022-12   46
Summary of temporal patterns

The combined dataset has a total of 75,732 observations with no missing values for PM\(_{2.5}\), as shown by a missing proportion of 0%. However, there are 215 implausible values (0.28%), defined as PM\(_{2.5}\) concentrations less than 0 or greater than 500, as given by the 2012 EPA. Temporal analysis of these implausible values reveals that all implausible values occurred in 2022, with no such values found in 2002. Delving into the monthly frequencies in which PM\(_{2.5}\) implausible values were recorded, the values were distributed throughout the year, with the highest occurrences in December (46 values) and July (27 values), while April and October had the fewest (4 values each).

  1. Explore the main question of interest at three different spatial levels. Create exploratory plots (e.g. boxplots, histograms, line plots) and summary statistics that best suit each level of data. Be sure to write up explanations of what you observe in these data.
  • State
# sub-setting for California data
california_data <- EPA_combined[State == "California"]

# summary statistics for PM2.5 in California across years
summary_stats_state <- california_data %>%
  group_by(Year) %>%
  summarize(
    mean_PM2.5 = mean(PM2.5, na.rm = TRUE),
    median_PM2.5 = median(PM2.5, na.rm = TRUE),
    sd_PM2.5 = sd(PM2.5, na.rm = TRUE),
    min_PM2.5 = min(PM2.5, na.rm = TRUE),         
    max_PM2.5 = max(PM2.5, na.rm = TRUE),         
    count = n()   
  )

# printing the summary statistics
print(summary_stats_state)
# A tibble: 2 × 7
   Year mean_PM2.5 median_PM2.5 sd_PM2.5 min_PM2.5 max_PM2.5 count
  <dbl>      <dbl>        <dbl>    <dbl>     <dbl>     <dbl> <int>
1  2002      16.1          12      13.9        0        104. 15976
2  2022       8.43          6.8     7.64      -6.7      302. 59756
# histogram of PM2.5 by year
ggplot(data = california_data) + 
  geom_histogram(aes(x = PM2.5, fill = as.factor(Year)), 
                 position = "identity", alpha = 0.6, binwidth = 2) +
  labs(title = "PM2.5 by Year in California", x = "Daily Mean PM2.5 Concentration (µg/m³)", 
       fill = "Year") +
  theme_minimal()

# boxplot of PM2.5 by year
ggplot(california_data, aes(x = as.factor(Year), y = PM2.5)) +
  geom_boxplot(fill = "pink", color = "purple", alpha = 0.7) +
  labs(title = "PM2.5 Concentrations by Year in California (2002-2022)",
       x = "Year",
       y = "Daily Mean PM2.5 Concentration (µg/m³)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# creating a summary data.table for highlighted years
highlight_years <- california_data[Year %in% c(2002, 2022), .(Mean_PM2.5 = mean(PM2.5, na.rm = TRUE)), by = Year]

# creating the line plot
ggplot(california_data, aes(x = Year, y = PM2.5)) +
  # Line plot for average PM2.5 using linewidth
  geom_line(stat = "summary", fun = mean, color = "pink", linewidth = 1) +
  # adding points for highlighted years
  geom_point(data = highlight_years, aes(x = Year, y = Mean_PM2.5), 
             size = 3, color = "purple", fill = "purple", shape = 21) +
  # adding circles around the points for emphasis
  geom_point(data = highlight_years, aes(x = Year, y = Mean_PM2.5), 
             size = 6, color = "purple", shape = 1) +
  labs(title = "Average PM2.5 Concentration Over Time in California (2002-2022)",
       x = "Year",
       y = "Average Daily Mean PM2.5 (µg/m³)") +
  theme_minimal()

Summary of observations

The data indicates a significant decrease in daily PM\(_{2.5}\) concentrations in California from 2002 to 2022. The mean concentration dropped from 16.12 μg/m³ to 8.43 μg/m³, showing nearly a 50% reduction. The spread of values also narrowed, suggesting fewer extreme pollution days. While 2022 still had occasional high pollution events, overall air quality improved markedly, with most days showing much lower PM2.5 levels compared to 2002. This trend reflects advancements in air quality management and pollution control measures over the past two decades.

  • County
# summary statistics for PM2.5 by counties in California across years 
summary_stats_county <- EPA_combined %>%
  group_by(County, Year) %>%
  summarize(
    mean_PM2.5 = mean(PM2.5, na.rm = TRUE),
    median_PM2.5 = median(PM2.5, na.rm = TRUE),
    sd_PM2.5 = sd(PM2.5, na.rm = TRUE),
    min_PM2.5 = min(PM2.5, na.rm = TRUE),         
    max_PM2.5 = max(PM2.5, na.rm = TRUE),         
    count = n(),                                   
    .groups = "drop"  # Add this line to control grouping behavior
  ) %>%
  arrange(County, Year)

# printing the summary statistics
print(summary_stats_county)
# A tibble: 98 × 8
   County        Year mean_PM2.5 median_PM2.5 sd_PM2.5 min_PM2.5 max_PM2.5 count
   <chr>        <dbl>      <dbl>        <dbl>    <dbl>     <dbl>     <dbl> <int>
 1 Alameda       2002      14.3          10      11.4        1.9      61.6   201
 2 Alameda       2022       8.20          7       4.95      -0.7      35.5  1793
 3 Butte         2002      14.8          11.5    11.7        1        88     473
 4 Butte         2022       6.19          4.5     5.79      -0.6      42.8  1121
 5 Calaveras     2002       9.9           8       6.50       2        40      60
 6 Calaveras     2022       6.04          5       4.10       0        25.9   355
 7 Colusa        2002      11.7           9      10.0        1        57      95
 8 Colusa        2022       7.61          6.7     4.76       0.6      37     401
 9 Contra Costa  2002      15.1           9.5    14.5        2        76.7   276
10 Contra Costa  2022       8.25          7.3     4.92       0.9      37.3   817
# ℹ 88 more rows
# ensuring 'Year' is treated as a factor
EPA_combined$Year <- as.factor(EPA_combined$Year)

# creating a bar plot
ggplot(data = EPA_combined, aes(x = County, y = PM2.5, fill = Year)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = c("2002" = "turquoise", "2022" = "pink")) +
  labs(title = "PM2.5 Trends by County (2002 vs 2022)",
       x = "County",
       y = "Mean Daily PM2_5 (µg/m³)",
       fill = "Year") +
  coord_flip()

Summary of observations

From 2002 to 2022, air quality in California’s counties, measured by PM\(_{2.5}\) levels, generally improved significantly. For example, Alameda County saw a reduction in mean PM\(_{2.5}\) from 14.25 µg/m³ to 8.20 µg/m³, and Butte County’s mean decreased from 14.76 µg/m³ to 6.19 µg/m³. Similar downward trends were observed across counties, such as Fresno, where PM\(_{2.5}\) levels dropped from 19.93 µg/m³ to 10.19 µg/m³. The decrease in both mean and maximum PM\(_{2.5}\) values indicates improved air quality, although variability persisted in some areas with occasional spikes, such as Trinity and Placer counties. Overall, air quality across the state showed marked improvements, with fewer high pollution days over the 20-year period.

  • Sites in Los Angeles
# sub-setting Los Angeles site data
la_data <- EPA_combined[County == "Los Angeles"]

# summary statistics for PM2.5 in Los Angeles sites across years
summary_stats_la <- la_data %>%
  group_by(Year) %>%
  summarize(
    mean_PM2.5 = mean(PM2.5, na.rm = TRUE),
    median_PM2.5 = median(PM2.5, na.rm = TRUE),
    sd_PM2.5 = sd(PM2.5, na.rm = TRUE),
    n = n()
  )

# printing the summary statistics
print(summary_stats_la)
# A tibble: 2 × 5
  Year  mean_PM2.5 median_PM2.5 sd_PM2.5     n
  <fct>      <dbl>        <dbl>    <dbl> <int>
1 2002        19.7         17.4    11.9   1879
2 2022        11.0         10.3     5.24  5070
# histogram of PM2.5 in Los Angeles sites
ggplot(la_data, aes(x = PM2.5, fill = Year)) +
  geom_histogram(binwidth = 2, color = "pink", alpha = 0.7, position = "identity") +
  labs(title = "Distribution of Daily Mean PM2.5 Concentrations at Los Angeles sites (2002-2022)",
       x = "Daily Mean PM2.5 Concentration (µg/m³)",
       y = "Frequency") +
  scale_fill_manual(values = c("2002" = "turquoise", "2022" = "pink")) + # Custom colors for each year
  theme_minimal() +
  theme(legend.position = "top")

# loading gridExtra library 
library(gridExtra)

Attaching package: 'gridExtra'
The following object is masked from 'package:dplyr':

    combine
# Splitting data into 2002 and 2022 subsets
  LA_2002 <- subset(la_data, Year == 2002)
  LA_2022 <- subset(la_data, Year == 2022)
  
# Ensure correct handling of dates (adding the year manually)
LA_2002$Date <- as.Date(paste("2002", format(LA_2002$Date, "%m-%d"), sep = "-"))
LA_2022$Date <- as.Date(paste("2022", format(LA_2022$Date, "%m-%d"), sep = "-"))

# Check if dates are ordered correctly
LA_2002 <- LA_2002[order(LA_2002$Date), ]
LA_2022 <- LA_2022[order(LA_2022$Date), ]

# Plotting PM2.5 levels for 2002
plot_2002 <- ggplot(LA_2002, aes(x = Date, y = PM2.5)) +
  geom_line(color = "turquoise") +
  geom_point(color = "turquoise") +
  scale_x_date(date_labels = "%b", date_breaks = "1 month") +  # Set month labels
  labs(title = "Change in PM2.5 in Los Angeles in 2002", x = "Month in 2002", y = "Daily Mean PM2.5 Concentration (µg/m³)") +
  theme_minimal()

# Plotting PM2.5 levels for 2022
plot_2022 <- ggplot(LA_2022, aes(x = Date, y = PM2.5)) +
  geom_line(color = "pink") +
  geom_point(color = "pink") +
  scale_x_date(date_labels = "%b", date_breaks = "1 month") +  # Set month labels
  labs(title = "Change in PM2.5 in Los Angeles in 2022", x = "Month in 2022", y = "Daily Mean PM2.5 Concentration (µg/m³)") +
  theme_minimal()

# Arrange both plots side-by-side
grid.arrange(plot_2002, plot_2022, ncol = 2)

Summary of observations

In Los Angeles County, the air quality significantly improved from 2002 to 2022, as indicated by a decrease in PM\(_{2.5}\) levels. In 2002, the mean PM\(_{2.5}\) was 19.66 µg/m³, with a median of 17.4 µg/m³, and a standard deviation of 11.88 µg/m³, based on 1,879 observations. By 2022, the mean PM\(_{2.5}\) had dropped to 10.97 µg/m³, with a median of 10.3 µg/m³ and a standard deviation of 5.24 µg/m³, based on a larger dataset of 5,070 observations.